SPATIAL PROFILING OF PROTEINS USING HYDROPHOBIC MOMENTS 



Cross Reference to Related Applications 

This application claims the benefit of United States Provisional Application 
5 Number 60/245,396, filed November 2, 2000. 

Field of the Invention 

The present invention relates to the mathematical analysis of proteins and, 
more particularly, relates to the spatial profiling of proteins using hydrophobic moments. 

10 

Background of the Invention 

Proteins may be thought of as string with beads on it. Each bead has a 
particular color. For many proteins, there are 20 colors, or 20 different beads. The string 
folds up in a certain way, which means that it ends up with a certain series of folds. When 

15 profiling a protein, researchers attempt to determine the order of the colors of the beads 
and where the beads are in three-dimensional space. These locations are important 
because all of the bodily functions depend on this three-dimensional structure. An 
important problem is determining how hundreds of thousands of proteins fold. 

Many proteins are globular and form in an intracellular environment or 

20 plasma, which are both aqueous environments. For these proteins, it can be assumed that 
there are only two colors, blue and red. Blue beads (called "hydrophobic") do not like 
water and red beads (called "hydrophilic") are attracted to water. When these types of 
globular proteins fold up, all of the blue beads get in the center and the red beads are on 
the outside of the protein. Consequently, the residues that like water are on the outside 

25 and the residues that do not like water are on the inside. A protein formed in this manner 
will have a hydrophobic core and a hydrophilic exterior. 

The structure of globular proteins can actually be quite complex, and 
contain substructures such as beta sheets, beta strands, alpha-helices, and other helices. 
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Because the structure of the protein affects the way that the protein interacts with its 
environment (and vice versa), protein structures have been studied in detail. A 
computational technique for studying proteins includes mathematically modeling protein 
structure to determine primary, secondary, tertiary, and even quaternary protein 
5 structures. 

Many of these techniques examine details associated with proteins, such as 
determining exactly where residues are or the exact order of residues. Few of these 
techniques are suitable for analyzing an entire protein. Even fewer of the these techniques 
can accurately determine whether a man-made protein structure is or could be a real 
10 protein. 

Thus, what is needed is a better way of quantifying and analyzing protein 
structure and a better way to determine if an example protein structure is or could be a 
real protein. 

15 Summary of the Invention 

Generally, the present invention provides a number of procedures to 
spatially profile proteins by using hydrophobic moments. In all procedures, a 
hydrophobicity distribution of a protein is shifted and normalized. This allows better 
quantitative comparisons of proteins. In one procedure, a shape or profile of a curve of a 

20 second-order moment of hydrophobicity is determined. This shape can then be used to 
determine if an example protein belongs to a particular class of proteins, such as globular 
proteins. A second procedure involves determining one or more ratios, such as the ratio of 
a distance at which the second order moment of hydrophobicity vanishes to the distance 
at which a zero-order moment of hydrophobicity vanishes. The distance at which a peak 

25 occurs in a profile of the zero- or second-order moment of hydrophobicity can also be 
used for comparison. These techniques also help to determine if a protein belongs to a 
globular or other class of proteins. For many of these techniques, a surface or profiling 
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contour can be chosen and used to accumulate hydrophobicities and to determine the 
moments. These procedures can be combined to provide a good mathematical 
determination of whether a protein belongs to a particular class of proteins. For globular 
proteins in particular, the present invention reveals that many globular proteins exhibit 
5 similar structural characteristics. This result may be used to easily determine if a decoy 
protein (a man-made exemplary protein) is a globular protein or a poor structural 
imitation. 

A more complete understanding of the present invention, as well as further 
features and advantages of the present invention, will be obtained by reference to the 
10 following detailed description and drawings. 

Brief Description of the Drawing s 

FIG. 1 is a flowchart of a method for spatially profiling proteins in 
accordance with one embodiment of the present invention; 
15 FIG. 2 is a table of the hydrophobicity values for amino acids; 

FIG. 3 is a system for spatially profiling proteins in accordance with one 
embodiment of the present invention; 

FIG. 4 is a table containing proteins from the Protein Data Bank (PDB) 
that were used in experiments involving an embodiment of the present invention; 
20 FIG. 5 is a profile showing the second-order moment, determined through 

use of an embodiment of the present invention, for the 1 AKZ protein; 

FIG. 6 is a profile showing the second- and zero-order moments, 
determined through use of an embodiment of the present invention, for the 1 AKZ protein; 

FIG. 7 is a profile showing a view along one principal axis of the IAKZ 

25 protein; 

FIG. 8 is a table that results when the IAKZ structure is fixed and 
hydrophobic values are randomly shuffled; 
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FIG. 9 shows a profile of results, obtained through use of an embodiment 
of the present invention, for the smallest protein 10RC; 

FIG. 10 shows a profile of results, obtained through use of an embodiment 
of the present invention, for the largest protein 1FEH; 
5 FIG. 1 1 shows a table of results, obtained through use of an embodiment 

of the present invention, for a number of proteins from the PDB; 

FIG. 12 shows a profile of a view along one of the principal axes of the 
protein 1LDM, with the ellipsoid intercept in the plane of the two other principal axes; 
and 

10 FIG. 13 shows a profile of typical results, obtained through use of an 

embodiment of the present invention, for a man-made protein structure (a "decoy")- 



Detailed Description of Preferred Embodiments 

The present invention provides a tool for probing protein structure. This 
15 tool may be used in such situations as protein folding, dynamic protein modeling or 
analysis of protein structure. The present invention may be used to analyze any protein 
but is particularly useful for analyzing proteins that form in an aqueous environment, 
such as globular proteins. It turns out, as will be discussed in more detail below, that 
globular proteins exhibit certain characteristics that can be determined by the present 
20 invention. These characteristics can be used to analyze a protein or decoy (a man-made 
protein) to see if it is a globular protein. Transmembrane proteins will have a different 
profile signature, but may also be analyzed by the present invention. 

Because globular proteins form in an aqueous environment, they have a 
hydrophobic core and a hydrophilic exterior. A hydrophobicity scale can be used to 
25 determine the hydrophobicity distribution of a protein. A hydrophobicity value is a value 
that indicates the degree to which a residue is attracted to or repelled by water. The 
resultant hydrophobicity distribution can be shifted and normalized, which places each 
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protein with mathematical basis for comparison. Without shifting the hydrophobicity 
distribution, the ability to compare different proteins is significantly degraded. If the 
hydrophobicity distribution is shifted but not normalized, the ratios disclosed herein can 
still be compared. However, values of the moments cannot be compared. 

After shifting and/or normalizing the hydrophobicity distribution, the 
adjusted zero- and second-order moments of the hydrophobicity distribution can be 
determined. The zero- and second-order moments are "adjusted" because they use a 
hydrophobicity distribution that is shifted or shifted and scaled. The shape or profile of 
the adjusted second-order moment can be used to determine if a protein is globular. All 
globular proteins studied to date exhibit a characteristic profile such that the adjusted 
second-order moment rises from zero to a high positive value, then passes through zero 
and becomes strongly negative. There is generally only one zero crossing after the high 
positive value, and the profile becomes strongly negative after the zero crossing. Any 
protein that does not exhibit this profile most likely is not a globular protein. 

Another technique that can be used to distinguish globular proteins from 
other proteins or decoys is the determination of a ratio of the distance at which the 
adjusted second-order moment of hydrophobicity vanishes and the distance at which the 
adjusted zero-order moment of the hydrophobicity vanishes (or vice versa). Another ratio 
that can be determined is a ratio of a distance at which a peak occurs in a profile of the 
zero-order moment of hydrophobicity and a distance at which the zero-order moment of 
hydrophobicity vanishes. Yet another ratio is a ratio between a distance at which a peak 
occurs in a profile of the second-order moment of hydrophobicity and the distance at 
which the second-order moment of hydrophobicity vanishes. For all globular proteins, 
both peaks of the zero- and second-order moments occur at the same distance from the 
centroid of the protein. Globular proteins tend to exhibit a certain range of these distance 
ratios. If a protein or decoy has a hydrophobicity ratio that is not within the range, then 
the protein or decoy is likely not a globular protein. 
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The "distance" discussed in the last paragraph is determined with 
reference to the centroid of the protein, which is the center of mass of the protein when 
each of residue is assigned unit mass. Additionally, a surface is necessary to determine 
the cumulative moments. A good choice of a surface for globular proteins is an 
5 ellipsoidal surface. The ellipsoidal surface is used to determine the cumulative moment at 
a particular distance from the centroid. The surface defines a volume that contains the 
hydrophobicity distribution of amino acid residues. 

Although the primary emphasis herein is placed on globular proteins, the 
present invention may be used to analyze other proteins, such as extracellular or 
10 transmembrane proteins, as well. For these proteins, suitable surfaces, such as spheres or 
cylinders, may be utilized. 

Referring now to FIG. 1, this figure shows a flow chart of a method 100 
for spatially profiling proteins by using hydrophobic moments. Method 100 is used to 
analyze a protein, analyze many proteins and/or determine if an exemplary protein 
15 belongs to a class of proteins that have already been analyzed using method 100. Method 
100 begins when the centers of residues are determined (step 110). The centers of 
residues can be either the a-carbon location of the residue or the centroid of the residue. 
The centroid of a residue can be determined by determining the center of mass of the 
residue, when each atom is assigned a location and the location is assigned a mass value 
20 of one. It is also possible to mix centroids, a-carbon locations (i.e., use the a-carbon 
location of one residue and the centroid of another residue), and centroids of residues that 
have atoms missing. 

The centroid of the protein (step 115) is determined as the centroid of 
residue centroids. 

25 In step 120, the hydrophobicity distribution is determined. Each residue is 

assigned a hydrophobicity consensus value hi. In this disclosure, a residue and an amino 
acid will be treated as being fungible. A representative table of hydrophobicity values is 
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shown in FIG. 2 and discussed below. The zero-order moment of the amino acid 
distribution of protein hydrophobicity is: 

Ho =2 hi. (Eq. 1) 

It should be noted that this is also the net hydrophobicity of the protein (step 120 of FIG. 
5 1). 

The first-order moment of the hydrophobicity distribution is: 
Hi =Zhi7i, (Eq.2) 
where r i is a vector to the centroid of the i* amino acid residue with hydrophobicity 

consensus value hi. The sum is over all n amino acid residues. Since the zero-order 
10 moment, H 0 , or net hydrophobicity of the protein, is generally non-vanishing, the 
first-order moment will depend upon the origin of the calculation. In connection with the 
calculated moments of a-helices, Eisenberg (see Eisenberg et al, Faraday Symp. Chem. 
Soc, 17, pp. 109-120, 1982; and Eisenberg et al, Nature, p. 299, 371-374, 1982, the 
disclosures of which are incorporated herein by reference) had pointed out that the 
15 first-order moment would be invariant if hydrophobicity differences about the mean, 
were calculated with respect to an arbitrary origin, as the following equation illustrates: 
H x =JLQii-h)ri (Eq.3) 

with h = Ho/n. Using the protein centroid as the origin of the moment expansion yields 
this invariant value of the first-order moment, namely: 
20 Hi =2 ht(ri -r c ), where (Eq. 4) 

r c = (Vn) 2r/ . (Eq. 5) 

The first-order moment calculated about the centroid of the protein is, 
therefore, a measure of first-order hydrophobic imbalance about the mean. With the 
inclusion of values of the solvent accessible surface area, s i9 for each of the residues, the 
25 surface exposed first-order hydrophobic moment imbalance about the entire protein can 
then be written: 
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H x =%h iSi {ri-r e ). (Eq.6) 
This could provide useful information with respect to the three-dimensional spatial 
affinity of the tertiary protein structure and external structures with which it might 
interact. Thus, these equations provide insight into protein structures. However, this 
5 would not profile the hydrophobicity distribution within the protein interior. 

Second-order moments provide the capability of spatially profiling the 
hydrophobicity distribution of amino acid residues. Profiling the distribution of 
hydrophobicity requires the choice of a profiling shape. Proteins come with all sorts of 
overall shapes. To profile, on must choose a particular reference point (the centroid), an 
10 appropriate coordinate system (the principal axes of geometry) and a shape representative 
of the protein (such as an ellipsoidal shape for a globular protein). A representation that is 
the simplest generalization of the shape of a globular protein is an ellipsoidal 
representation. This representation can be generated from the molecular 
moments-of-geometry, i.e., moments-of-inertia for which all amino acid residue centroids 
15 are weighted by unity instead of by residue mass. The moments of geometry are obtained 
as eigenvalues of the following moment-of-geometry matrix written in dyadic notation: 
M 2 =Z (T| r f -r c I 2 - Cn ~r c )(r, -r c )), (Eq. 7) 

where 1 is the unit dyadic. The calculation is performed with the centroid (determined by 
using the amino acid centroids) of the protein as origin. The moments-of-geometry are 

20 designated g h g 2 , and g 3 , with gi <g2 <gi- The ellipsoidal representation generated by 
these moments is written as: 

x 2 +g'y+g'3Z 2 = d 2 (Eq.8) 
with g' 2 ~gilg\ and g' 3 =gilg\. The coordinates, x 9 y 9 z, are written in the frame of the 
principal-geometric-axes. Equation 8 determines a surface (step 135) that can be used to 

25 profile the moments of the hydrophobicity distribution. 

The ellipsoidal surface obtained by the choice of a particular value of d 
enables the collection of the values of hydrophobicity for all amino acid residues of 
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number, n d , lying within this surface. The consensus hydrophobicity scale of FIG. 2, 
discussed in more detail below, can be used to assign individual hydrophobicities for each 
residue. 

The hydrophobicity distribution arises from the spatial distribution of 
5 residues and their assigned values of hydrophobicity. The distribution of amino acid 
hydrophobicity is, however, shifted (step 140) such that the net hydrophobicity of each 
protein vanishes. This is done by subtracting the average hydrophobicity from each value 
in the hydrophobicity distribution. Thus, when the surface described by d encompasses all 
of the residues, the shifted hydrophobicity distribution will yield a net hydrophobicity 
10 value of zero. 

It should be noted that it is not necessary to zero the net hydrophobicity 
when the last residue is collected. Optionally, one could profile the protein by zeroing out 
the zero-order moment (which is an indication of the net hydrophobicity up until a certain 
distance) at a location in the protein interior. 

15 Such shifting of the values of amino acid hydrophobicity eliminates the 

zero-order moment from the distribution and, consequently, the dependence of the 
second-order moment upon differences in net protein hydrophobicity. This provides a 
basis for comparison of the hydrophobic moment profiles of the different proteins and, 
consequently, a basis for comparison of their hydrophobic ratios. 

20 The distribution is then optionally, but preferably, normalized (step 145) to 

yield a standard deviation of one. This step enables comparison of the moment 
magnitudes of different proteins. 

The average hydrophobicity per residue collected within the ellipsoidal 
surface specified by d is then written (step 150): 

25 H d 0 (d) = (\ln d ) I h\ = (\ln d ) 2 (hi -h)l<hj- h) 2 > m . (Eq. 9) 

i<d i<d 

Equation 9 is one way to create an adjusted zero-order hydrophobic moment. The 
superscript, d, indicates that the moment has been divided by the number of residues, n d . 
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Dividing by the number of residues is not necessary, but can be used to aid comparisons. 
The prime designates the value of hydrophobicity of each residue after shifting and 
normalizing the distribution. The term (hi-h) shifts the hydrophobicity distribution, 
while the term <hj-h) 2 > m normalizes the distribution. The subscript, 7, and the 
5 brackets, "< >", refer to an average over a different index from the subscript, /. When the 
value of the surface d is just sufficiently large enough to collect all of the residues, the net 
hydrophobicity of the protein vanishes (step 155). This value of d assigns a "protein 
surface" as a location of common reference. Calculations that are performed for each of 
the proteins, as discussed in the Example section below, will examine increasing the 
10 value of d until all residues have been collected and the mean hydrophobicity vanishes. 

The value of the second-order ellipsoidal moment per residue (step 160), 
from residues lying within the ellipsoidal surface specified by d is written: 

H d 2 (d) = {\ln d ) I h\{x} +g 3 z}) = (\ln d ) S h\d] (Eq. 10) 

Equation 10 is one way to create an adjusted second-order hydrophobic moment. When 
15 all residues fall within the ellipsoidal surface and are collected, the following results: 

H d 2 = (Vn d ) I h^df = (1/n) I (hi/ <hj-h> 2 ) m )(d] - d\ (Eq. 1 1) 

where: 

t = (l/n)T.df. (Eq. 12) 

The values of Hq and H( are calculated for each protein with increasing 
20 values of the surface defined by d. 

Once the zero- and second-order hydrophobic moments have been 
determined, the distances at which peaks occur for the profiles of these moments may be 
determined (step 165). The distances of the peaks are preferably determined as being 
distances from the centroid of the protein. Some exemplary peaks and distances are 
25 described below. 

In step 170, the distance is determined at which the second-order 
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hydrophobic moment becomes zero. The distance d. is the value of d for which H\ h as 
changed sign, becoming negative, and do the value for which Hq vanishes. The protocol 
that, for d. to be chosen, all values of Hi at larger values of d must be negative, seems to 
be a quick estimate of when the second-order hydrophobic moment vanishes. A more 
5 accurate estimate would choose the value of d for which the second-order moment was 
the smallest. 

In step 175, various hydrophobic ratios are determined. One possible ratio 
is the ratio between d. and d 0 (i.e., R equal to dJd 0 ). Another ratio is the ratio between a 
distance at which a peak of the zero-order moment of hydrophobicity occurs (d 0p ) and a 
10 distance at which the zero-order moment of hydrophobicity vanishes (i.e., R equal to 
^ dop/do). A third ratio is the ratio of a distance at which a peak of the second-order moment 

IB of hydrophobicity occurs (d 2p ) and the distance at which the zero-order moment of 

IB hydrophobicity vanishes (i.e., R equal to d 2 p/do). The latter two ratios, as seen and 

E discussed below, are equal. 

^ 15 For globular proteins, these ratios should be comparable and act as 

□ discriminative devices, which can include or exclude proteins from a set of representative 

lj globular proteins. 

if In step 180, results from examining the current protein can be compared 

with results determined previously. This step allows a set of proteins to be determined 

20 and a general profile that matches each of the profiles for the zero- and/or second-order 
hydrophobic moments to be determined. Ranges of ratios for the set of proteins can also 
be determined. If the protein being examined has profiles that are of a shape similar to the 
general profile, then the current protein is assumed to belong to the class of proteins 
defined by the set of proteins. Similarly, if the ratios for the current protein are within a 

25 predetermined amount from the range of ratios obtained for the set of proteins, then the 
current protein is assumed to belong to the class of proteins defined by the set of proteins. 

In this manner, either single proteins or a set of proteins may be examined 
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and profiled or compared with the profiles or ratios determined from a training set of 
proteins. 

Referring now to FIG. 2, this figure shows the consensus values of 
hydrophobicity for the twenty residues. These hydrophobicity values used for each of the 
5 examples in the Examples section below were taken from this table. 

Turning now to FIG. 3, an exemplary system 300 is shown that could be 
used to perform the methods and apparatus of the present invention. System 300 
comprises a compact disk 305, a computer system 310 that itself comprises processor 320 
and memory 325, and a connection to a network (the network is not shown in FIG. 3). 

10 Memory 325 comprises some or all of the elements used to perform the embodiments of 
the present invention. As such, memory 325 will configure the processor 320 to 
implement the methods, steps, and functions disclosed herein. The memory 325 could be 
distributed or local and the processor 320 could be distributed or singular. The memory 
325 could be implemented as an electrical, magnetic or optical memory, or any 

15 combination of these or other types of storage devices. Moreover, the term "memory" 
should be construed broadly enough to encompass any device or medium where 
information can be read from or written to an address in the addressable space accessed 
by processor 320. With this definition, information on a network is still within memory 
325 of system 300 because the processor 320 can retrieve the information from the 

20 network. It should be noted that each distributed processor that makes up processor 320 
will generally contain its own addressable memory space. 

It should also be noted that computer system 310 could be an 
application-specific integrated circuit that performs some or all of the steps and functions 
disclused herein. 

25 As is known in the art, the methods and apparatus discussed herein may be 

distributed as an article of manufacture (such as compact disk 305) that itself comprises a 
computer readable medium having computer readable program code embodied thereon. 
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The computer readable program code is operable, in conjunction with a computer system, 
to carry out all or some of the steps to perform the methods or create the apparatuses 
discussed herein. The computer readable medium may be a recordable medium (e.g., 
floppy disks, hard drives, compact disks, or memory cards) or may be a transmission 
5 medium (e.g., a network comprising fiber-optics, the world-wide web, cables, or a 
wireless channel using time-division multiple access, code-division multiple access, or 
other radio-frequency channel). Any medium known or developed that can store 
information suitable for use with a computer system may be used. The computer-readable 
program code is any mechanism for allowing a computer to read instructions and data, 
10 such as magnetic variations on a magnetic medium or height variations on the surface of 
compact disk 305. 

What has been shown so far is a tool for probing proteins and revealing 
structures of proteins that have not been determined before. This tool also provides better 
comparisons between proteins than what has come before. Because the benefits of the 
15 present invention are hard to envision when equations are solely used, the following 
Examples section provides a more visual and succinct description of results obtained by 
using the present invention. 

EXAMPLES 

Now that the methods of the present invention have been presented, 
20 experimental results will be presented. For the experimental results, protein structures 
were selected by keyword searches of the Protein Data Bank (PDB) and by examination 
of entries in different SCOP classes. For more discussion on the latter, see Murzin et al., 
Journal of Molecular Biology 247, 536-540, 1995, the disclosure of which is incorporated 
herein by reference. The objective was to choose a selection representative of different 
25 sizes and different classes. Thirty protein structures were chosen in this manner. For an 
internal check, two of the proteins chosen included 1CTQ and 12 IP, the same protein 
with independently determined structures. Three additional proteins were also chosen 
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from the recently determined structure of the 30S ribosomal subunit. For more 
information about the structure of the 30S ribosomal subunit, see Wimberly et al., Nature 
407, 327-339, 2000, the disclosure of which is incorporated herein by reference. The PDB 
identifications (IDs) and number of amino acid residues for each are listed in FIG. 4. 
5 Finally, fourteen simple decoys as well as their native structures were also chosen for 
examination. For more discussion of these decoys, see Holm et al., Journal of Molecular 
Biology 225, 93-105, 1992, the disclosure of which is incorporated herein by reference. 

Detailed results of profiling one of the structures, 1AKZ, are shown in 
FIGS. 5 and 6. FIG. 5 shows the profile of the accumulated zero-order moment, H 0 (d) 9 

10 and second-order moment, H 2 (d). FIG. 6 lists the moments per residue for H 0 (d) and for 
H 2 (d). As the distance, d, that defines the extent of the ellipsoid is increased, the first 
residue falls within the ellipsoidal surface at a value of d equal to 4 Angstroms. From 
FIG. 5, one sees the second-order moment increase in value until it turns around, rapidly 
becoming negative. At the one-Angstrom resolution of the calculation shown in FIG. 6, 

15 the first negative value appears at d. equal to 23. The hydrophobicity, , of the protein 
becomes zero at d 0 equal to 30. The hydrophobic-ratio, Rt, has a value, therefore, of 23/30 
equal to 0.77. The steep decrease of the ellipsoidal moment tapers off in the final range of 
25 to 30 Angstroms. Both zero- and second-order moments peak at the same value of d 
and this distance, at which the maximum occurs, can also be used as a feature for 

20 comparison between different proteins. 

FIG. 7 shows a view along one of the three principal axes of the protein 
1AKZ. The projections of the amino acid centroids have been plotted as well as the 
elliptical boundaries in the plane containing two of the principal axes. The ellipses have 
been plotted for the value d equal to 16, where the second-order moment is greatest, the 

25 value of d. equal to 23, the value at which H 2 has just changed sign and the value d 0 equal 
to 30, the value for which all amino acid residue centroids just fall within the ellipsoidal 
surface. The latter is the point where the protein hydrophobicity vanishes. The region of 



YOR920000779US2 



increasing H 2 reflects the predominance of the spatial distribution of residues comprising 
the hydrophobic core. At larger values of d, the slowing of this increase and plunge to 
negative values reflects the spatially increasing prevalence of hydrophilic residues. Such 
regular behavior is required for the identification of d. and consequently for the 
5 calculation of Rt. Keeping the 1AKZ structure fixed and randomly shuffling the 
hydrophobicity values among the different residues yields the results shown in FIG. 8. It 
is evident from examination of this table that a value of d. cannot be assigned from this 
distribution of values of the second-order moment. 

FIGS. 9 and 10 show the second-order ellipsoidal moment profiles 

10 obtained for the smallest protein, 10RC, and the largest protein, 1FEH. 10RC has been 
profiled with a resolution of 0.25 Angstroms in FIG. 10. At this resolution, Rt is equal to 
0.68. Even though the scales of the axes of both figures differ significantly, the overall 
profile shapes over the extent of the proteins are similar. Again, there is an initial increase 
in the value of the second-order moment before plunging to negative values. The 

15 hydrophobic ratios, R h of 10RC and 1FEH are 0.70 and 0.71, respectively, for the one 
Angstrom resolution used to obtain the entries listed in FIG. 11. These two example 
proteins highlight the relative independence of the overall second-order moment profile 
shape and hydrophobic-ratio with respect to differences in protein size. 

All thirty protein structures that were tested exhibit similar spatial 

20 behavior for either the accumulated second-order hydrophobic moment, H2(d), or H((d), 
the moment per residue. The accumulated profiles are, however, somewhat smoother and 
accentuate the plunge to negative values as the surface of the protein is approached. FIG. 
1 1 lists the value of the hydrophobic-ratio for each of the protein structures. All thirty 
structures yield a mean value of the ratio equal to 0.75, with a standard deviation of 

25 0.045. The numerator and denominator of Rt, d. and d 0 , are also listed. This clearly shows 
how d. increases with increasing protein size to provide comparable values of the ratio for 
all thirty proteins. The value of d 0 scales roughly as a factor of two between the largest 
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and smallest proteins examined. This is as expected, since the ratio of the number of 
amino acid residues of the largest to smallest protein structures is approximately equal to 
600/70, and, consequently, (600/70) 1/3 » 2. The distance d 0 can be considered an 
approximate measure of the linear extent of the protein. Consequently, the values of d. are 
5 then equal to a comparable fraction of the extent of each of the proteins, for all of the 
structures. 

FIG. 1 1 also shows the results of profiling the distribution with a spherical 
instead of ellipsoidal contour. The crossover between the positive and negative values of 
H 2 is still well defined. Consequently, a value for the hydrophobic-ratio, Rt, can be 
10 calculated. It can be noted that there is greater variability in the hydrophobic-ratio with 
^ spherical profiling. 

|B A few of the proteins require special attention. Three of the structures, 

Sees 

m 1PDO, 1LDM and 1FSZ, have extended arms that are away from the main body of the 

]S* protein. Collecting all residues to determine the value of d 0 yields a value that is not 

^ 15 representative of the protein bulk. Shifting the scale of residue hydrophobicity such that 

O the net hydrophobicity of the protein is zero when all residues of the bulk are collected, 

:7j yields the values given in FIG. 11. FIG. 12 shows a view along one of the principal axes 

^ of 1LDM with the ellipsoidal intercept in the plane of the two other principal axes. The 

M intercept has been drawn for the value, d equal to 37, a value that does not include the 

20 contribution from the structural arm. 

Structure 1LBU exhibits slightly deviant behavior of H 2 . There is a rapid 
crossover to a negative value of the second-order moment at a value of d equal to 20. This 
value remains negative, until at d equal to 23 it becomes marginally positive before 
becoming negative again at d equal to 24 and thereafter. The two zero crossovers at d 
25 equal to 20 and d equal to 24 yield a hydrophobic-ratio average of 0.76. 

Two of the ribosomal proteins, B_1FJF (chain B; protein S2) and D_1FJF 
(chain D; protein S4) are the largest deviants with respect to the values of Rt for the 
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non-ribosomal proteins. On the other hand, C_1FJF (Chain C; protein S3) yields a value 
of Rt that is within the range of the other thirty values. C_1FJF makes no contact with 
RNA at all and exhibits an a//?-domain frequently found in different proteins with a 
-helices packed against a /?-sheet. 
5 Finally, ellipsoidal moment profiling has been performed on a simple 

decoy set. Fourteen decoys and native structures of this set, with a number of residues 
greater than one hundred, were downloaded from the Internet at location 
http://dd.stanford.edu/download.shtml. Twenty-eight moment calculations were, 
therefore, performed. A typical result is shown in FIG. 13. Visual inspection of the figure 

10 clearly delineates the difference between the correct or native structure and the decoy 
structure. Figures for all of the fourteen structures look essentially the same. All native 
structures exhibit a second-order moment profile similar to what had been obtained for 
the thirty PDB structures. Consequently, hydrophobic ratios can be calculated and they 
span the range of values previously found for the thirty. The spatial transition to the 

1 5 hydrophilic exterior of the native structures is significantly amplified by the second-order 
moment. The decoys do not exhibit this plunge to negative values of the second-order 
moment, nor is the relatively regular behavior in the protein interior reproduced. 
Hydrophobic ratios cannot, therefore, be assigned to any of the decoy structures. 

The comparison between the second-order moment profiles of the native 

20 with the decoy structures is revealing. The second-order moment amplifies differences 
about the mean protein hydrophobicity. Profiles of the native structures reflect the 
significant separation between the hydrophobic residues comprising the core and the 
hydrophilic residues the protein exterior. The decoy residue distribution fails to mirror 
this separation. This suggests that moment profiling should play an important role in 

25 recognizing the difference between native folds and decoy folds. It should also play a role 
in validating predicted protein structures. 

With respect to molecular dynamics and protein folding pathways, 
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profiling could be done at various points in the folding trajectory. One would then look 
for trajectories that begin to exhibit a relatively smooth monotonic increase of the 
second-order moment in the structural interior with the onset of a transition to negative 
values near the exterior. It would then be of interest to see how close such identification 
5 would appear with respect to the final native structure achieved. After identification or 
selection of such trajectory, fine-tuning could then be observed or directed by 
examination of the hydrophobic-ratio. Considering the native structure as the endpoint in 
the folding trajectory, perhaps the moment regularities will provide not only constraints 
with respect to the pathways selected but also provide a clue to the underlying processes 
10 responsible for such selection. 

The procedures described in this disclosure need not be restricted to 
examination of globular proteins, but can be used in connection with the profiling of 
proteins of diverse overall structure with the choice of an appropriate overall profiling 
geometry 

15 Thus, what has been shown are techniques for determining profiles and 

ratios for protein probing and analysis. In the case of globular proteins, heretofore unseen 
characteristics and similarities between relatively diverse proteins have been shown. 
Moreover, the present invention allows decoy and unrelated proteins to easily be 
excluded from a group of already examined and similar proteins. 

20 It is to be understood that the embodiments and variations shown and 

described herein are merely illustrative of the principles of this invention and that various 
modifications may be implemented by those skilled in the art without departing from the 
scope and spirit of the invention. For instance, surfaces other than an ellipse, such as a 
conical surface or cylindrical surface could be used. Additionally, shifting could be used 

25 without normalization. 
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